2 Preparing Data for Tables
Now that we’ve covered the main types of variables, we can start thinking about how to prepare and visually represent data with figures and tables. In this section, we will focus on preparing data for tables.
In R, datasets are stored as dataframes, which are objects made up of rows and columns. Each row corresponds to an individual observation, while each column represents the variable that is being measured. There are several ways R stores variables: character for text, factor for categorical, integer for whole numbers, numeric for decimals, and logical for booleans (TRUE/FALSE).
Cleaning data can be the most time-consuming, yet arguably the most important, part of analysing data! Generally, it is
- wide format to long format
To understand R, we will examine these concepts using real-world datasets.
NAPLAN
The first dataset to be explored is the NAPLAN dataset. NAPLAN stands for the National Assessment Program - Literacy and Numeracy, and is an annual national assessment for all students in year 3, 5, 7, and 9. The assessment is designed to evaluate whether students are building essential literacy and numeracy skills (https://www.nap.edu.au/naplan/faqs/naplan–general).
To load in the dataset, we can use the read.csv() function.
naplan <- read.csv("Data-sets/naplan_reading.csv")Now that we have the dataset loaded in, let’s take a look at the variables in the dataset using str().
str(naplan)'data.frame': 3000 obs. of 11 variables:
$ student_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ school_id : int 51 26 14 21 33 26 37 41 8 60 ...
$ grade : chr "Year 7" "Year 3" "Year 7" "Year 3" ...
$ reading_time_home : int 70 70 30 0 20 25 5 90 45 10 ...
$ parent_education : chr "Certificate/Diploma" "Year 12" "Bachelor degree" "Bachelor degree" ...
$ school_type : chr "Government" "Government" "Government" "Independent" ...
$ gender : chr "Female" "Female" "Male" "Female" ...
$ birth_months : chr "Aug" "Apr" "Jul" "Sep" ...
$ n_siblings : int 2 1 1 1 0 1 3 2 1 3 ...
$ ses_index : num 0.87 0.11 1.14 0.66 0.27 -0.95 -0.65 1.01 -0.6 -1.14 ...
$ naplan_reading_score: int 724 439 631 491 635 592 449 484 730 528 ...
In the top line, we can see that there are 3000 observations and 11 variables.
We can confirm this using the dim() function
dim(naplan)[1] 3000 11
Using the names() function, we can see the variable names.
names(naplan) [1] "student_id" "school_id" "grade"
[4] "reading_time_home" "parent_education" "school_type"
[7] "gender" "birth_months" "n_siblings"
[10] "ses_index" "naplan_reading_score"
The first column gives us the variables, the second column gives us the type of variable, and the third column gives us individual observations for each row. However, we can see school_type is stored as a chr or character (text) variable. Text variables are very common when you encounter surveys that have customer feedback, for example, as each response will be unique from one another.
It is important to check that your variables are stored correctly. We can see certain variables like reading time at home is an integer, which matches our expectations.
You can use the unique() function to see the unique groupings or labels within a categorical variable. In our case, we can use it to confirm if our suspicions that school_type is categorical. To call a specific variable, we call the dataset followed by $ and then the variable name.
unique(naplan$school_type)[1] "Government" "Independent" "Catholic"
There are only 3 unique labels, which strongly suggests that school type is a nominal categorical variable. If it were truly a text variable, we might see thousands of unique labels.
We can manually convert the variable to categorical using as.factor():
naplan$school_type <- as.factor(naplan$school_type)str(naplan)'data.frame': 3000 obs. of 11 variables:
$ student_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ school_id : int 51 26 14 21 33 26 37 41 8 60 ...
$ grade : chr "Year 7" "Year 3" "Year 7" "Year 3" ...
$ reading_time_home : int 70 70 30 0 20 25 5 90 45 10 ...
$ parent_education : chr "Certificate/Diploma" "Year 12" "Bachelor degree" "Bachelor degree" ...
$ school_type : Factor w/ 3 levels "Catholic","Government",..: 2 2 2 3 2 2 2 2 2 2 ...
$ gender : chr "Female" "Female" "Male" "Female" ...
$ birth_months : chr "Aug" "Apr" "Jul" "Sep" ...
$ n_siblings : int 2 1 1 1 0 1 3 2 1 3 ...
$ ses_index : num 0.87 0.11 1.14 0.66 0.27 -0.95 -0.65 1.01 -0.6 -1.14 ...
$ naplan_reading_score: int 724 439 631 491 635 592 449 484 730 528 ...
Now we can see school_type is a factor with 3 levels (or 3 unique labels).
Alternatively, we can also convert all strings (text) to categorical variables with the stringsAsFactors argument when we read in the dataset.
naplan <- read.csv("Data-sets/naplan_reading.csv",
stringsAsFactors = TRUE)str(naplan)'data.frame': 3000 obs. of 11 variables:
$ student_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ school_id : int 51 26 14 21 33 26 37 41 8 60 ...
$ grade : Factor w/ 4 levels "Year 3","Year 5",..: 3 1 3 1 4 4 1 3 4 3 ...
$ reading_time_home : int 70 70 30 0 20 25 5 90 45 10 ...
$ parent_education : Factor w/ 5 levels "Bachelor degree",..: 2 5 1 1 5 4 4 1 4 4 ...
$ school_type : Factor w/ 3 levels "Catholic","Government",..: 2 2 2 3 2 2 2 2 2 2 ...
$ gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 1 1 1 2 ...
$ birth_months : Factor w/ 12 levels "Apr","Aug","Dec",..: 2 1 6 12 3 10 2 12 11 8 ...
$ n_siblings : int 2 1 1 1 0 1 3 2 1 3 ...
$ ses_index : num 0.87 0.11 1.14 0.66 0.27 -0.95 -0.65 1.01 -0.6 -1.14 ...
$ naplan_reading_score: int 724 439 631 491 635 592 449 484 730 528 ...
As seen above, all the chr variables have been converted to Factor, which is how the variables should be stored. Note that there might be times where you need a variable stored as chr, so you should always be sure of what you’re checking.
We can now attach the dataset. Attaching the dataset means we can call variables directly without using $. For instance, we can call school_type directly as opposed to naplan$school_type.
attach(naplan)If we type naplan directly in, we can see the content.
naplanAlternatively, we can use the View() function to open up a new tab in RStudio with the dataset.
View(naplan)We can also select parts of the dataframe. In R, we subset by calling the dataset followed by [,] . Writing a blank space before the comma indicates that all rows are includes, while a blank space after the comma indicates that all columns are included. For example, [,3] selects the third column of a dataframe. If we want to select the first three columns, we can write:
# Select the first three rows and three columns of the Naplan dataset
naplan[1:3,1:3] student_id school_id grade
1 1 51 Year 7
2 2 26 Year 3
3 3 14 Year 7
Perhaps we only want to select the middle rows for all the columns of the dataframe.
# Select the fifth and sixth row of the Naplan dataset with
naplan[5:6,] student_id school_id grade reading_time_home parent_education school_type
5 5 33 Year 9 20 Year 12 Government
6 6 26 Year 9 25 Year 10 or below Government
gender birth_months n_siblings ses_index naplan_reading_score
5 Male Dec 0 0.27 635
6 Female Nov 1 -0.95 592
We can even select rows based on logical tests on the values of one or more variables.
# Select all rows (observations) that have a reading time at
# home greater than 30 mins and a naplan reading score greater than 800
naplan[reading_time_home > 30 & naplan_reading_score > 860,] student_id school_id grade reading_time_home parent_education
493 493 1 Year 9 65 Year 12
584 584 1 Year 9 40 Certificate/Diploma
741 741 42 Year 9 105 Year 12
1118 1118 42 Year 9 95 Certificate/Diploma
1168 1168 31 Year 9 60 Postgraduate
1174 1174 35 Year 9 65 Year 12
1447 1447 23 Year 9 75 Year 12
1792 1792 5 Year 9 70 Year 12
1906 1906 5 Year 9 70 Year 10 or below
1952 1952 35 Year 9 75 Year 12
2122 2122 5 Year 9 55 Postgraduate
2359 2359 35 Year 9 60 Bachelor degree
2425 2425 47 Year 9 50 Postgraduate
school_type gender birth_months n_siblings ses_index naplan_reading_score
493 Independent Male Aug 0 1.05 900
584 Independent Female Sep 1 0.68 900
741 Government Male Aug 5 -0.44 900
1118 Government Male Jul 2 -0.28 883
1168 Catholic Female Aug 2 2.19 871
1174 Government Male Aug 2 -0.33 900
1447 Independent Male Sep 2 0.30 900
1792 Government Male Aug 1 0.52 900
1906 Government Male Jan 5 -0.71 887
1952 Government Male Aug 1 -0.17 900
2122 Government Male Jul 3 1.87 900
2359 Government Female Dec 5 1.33 900
2425 Independent Female Aug 2 2.02 866
We can also sort the rows and the columns in the dataset. By default, R will sort in ascending order.
# Order the naplan reading score (ascending)
naplan[order(naplan$naplan_reading_score), ][1:10, ] student_id school_id grade reading_time_home parent_education
2511 2511 13 Year 3 85 Year 12
2495 2495 18 Year 3 70 Year 12
1158 1158 57 Year 3 60 Postgraduate
1053 1053 18 Year 3 70 Postgraduate
617 617 13 Year 7 100 Certificate/Diploma
2349 2349 18 Year 3 80 Year 12
1207 1207 18 Year 3 50 Year 10 or below
138 138 57 Year 3 35 Year 12
1697 1697 18 Year 5 55 Year 10 or below
467 467 9 Year 3 50 Year 10 or below
school_type gender birth_months n_siblings ses_index naplan_reading_score
2511 Independent Female Dec 3 0.67 177
2495 Government Female Jan 2 -0.15 270
1158 Catholic Female Jan 1 1.80 273
1053 Government Female Sep 0 1.99 288
617 Independent Male Aug 1 1.45 300
2349 Government Male Aug 1 -0.10 303
1207 Government Female Jan 2 -2.15 309
138 Catholic Female Mar 2 0.24 312
1697 Government Male Oct 2 -1.12 312
467 Catholic Male Aug 2 -0.45 315
We can also sort and select odd-numbered columns.
# Order the naplan reading score (ascending)
# and select only odd-numbered columns
naplan[order(naplan_reading_score), c(1, 3, 5, 7, 9, 11)][1:3,] student_id grade parent_education gender n_siblings naplan_reading_score
2511 2511 Year 3 Year 12 Female 3 177
2495 2495 Year 3 Year 12 Female 2 270
1158 1158 Year 3 Postgraduate Female 1 273
Or descending order.
# Order the naplan reading score (descending)
# and select only odd-numbered columns
naplan[rev(order(naplan_reading_score)), c(1, 3, 5, 7, 9, 11)][1:3,] student_id grade parent_education gender n_siblings naplan_reading_score
2359 2359 Year 9 Bachelor degree Female 5 900
2122 2122 Year 9 Postgraduate Male 3 900
1952 1952 Year 9 Year 12 Male 1 900
We can summarise the content of the dataframe as well.
summary(naplan) student_id school_id grade reading_time_home
Min. : 1.0 Min. : 1.00 Year 3:735 Min. : 0.00
1st Qu.: 750.8 1st Qu.:18.00 Year 5:771 1st Qu.: 5.00
Median :1500.5 Median :32.00 Year 7:745 Median : 15.00
Mean :1500.5 Mean :31.90 Year 9:749 Mean : 21.62
3rd Qu.:2250.2 3rd Qu.:45.25 3rd Qu.: 30.00
Max. :3000.0 Max. :60.00 Max. :120.00
parent_education school_type gender birth_months
Bachelor degree :771 Catholic : 540 Female:1451 Dec : 325
Certificate/Diploma:701 Government :2023 Male :1549 Aug : 314
Postgraduate :297 Independent: 437 Jan : 300
Year 10 or below :466 Sep : 299
Year 12 :765 Jul : 296
Nov : 296
(Other):1170
n_siblings ses_index naplan_reading_score
Min. :0.000 Min. :-2.3200 Min. :177.0
1st Qu.:1.000 1st Qu.:-0.1100 1st Qu.:491.0
Median :2.000 Median : 0.5300 Median :562.0
Mean :1.899 Mean : 0.5246 Mean :558.5
3rd Qu.:3.000 3rd Qu.: 1.1500 3rd Qu.:618.2
Max. :5.000 Max. : 3.4100 Max. :900.0
You might notice that the summary doesn’t quite work for categorical variables.
We can use tapply() and with() to summarise a quantitative variable by a qualitative variable.
with(naplan, tapply(naplan_reading_score, grade, mean)) Year 3 Year 5 Year 7 Year 9
457.1687 539.7821 594.5785 641.1722
aggregate() allows you to summarise by one or more categorical variables, while tapply can only summarise one variable at a time.
aggregate(naplan[,c(4,11)], list(grade),mean) Group.1 reading_time_home naplan_reading_score
1 Year 3 21.59184 457.1687
2 Year 5 21.83528 539.7821
3 Year 7 22.24832 594.5785
4 Year 9 20.80774 641.1722
aggregate(naplan[,c(4,11)], list(Grade = grade),mean) Grade reading_time_home naplan_reading_score
1 Year 3 21.59184 457.1687
2 Year 5 21.83528 539.7821
3 Year 7 22.24832 594.5785
4 Year 9 20.80774 641.1722
aggregate(naplan[,c(4,11)], list(Grade = grade, "Parent Education" = parent_education),mean) Grade Parent Education reading_time_home naplan_reading_score
1 Year 3 Bachelor degree 21.35000 478.5500
2 Year 5 Bachelor degree 19.63054 564.7783
3 Year 7 Bachelor degree 21.15591 618.3065
4 Year 9 Bachelor degree 18.48901 653.3187
5 Year 3 Certificate/Diploma 18.17919 457.6127
6 Year 5 Certificate/Diploma 21.78977 536.1705
7 Year 7 Certificate/Diploma 23.15217 595.2554
8 Year 9 Certificate/Diploma 22.44048 642.1786
9 Year 3 Postgraduate 20.06494 487.3117
10 Year 5 Postgraduate 20.19481 575.4026
11 Year 7 Postgraduate 23.10606 628.7727
12 Year 9 Postgraduate 17.20779 684.5325
13 Year 3 Year 10 or below 22.47475 431.2323
14 Year 5 Year 10 or below 25.49587 505.1983
15 Year 7 Year 10 or below 21.30435 558.1391
16 Year 9 Year 10 or below 21.48855 609.4809
17 Year 3 Year 12 25.18817 435.0914
18 Year 5 Year 12 22.55155 524.3351
19 Year 7 Year 12 22.70619 581.1546
20 Year 9 Year 12 22.56545 632.9686
Exercise: Palmer Penguins
Photo by Derek Oyen on Unsplash
First, we will need to load in some data into R. We will be working with the palmerpenguins package. The palmerpenguins package is a dataset collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.
If you have not used this package before, you will need to install it first.
install.packages("palmerpenguins")Once installed, you will need to load the package into R.
# Load in the `palmerpenguins` package
library(palmerpenguins)
# Load in the `palmerpenguins` package
library(palmerpenguins)Now that we have the package installed, let’s take a look at the variables in the dataset using str() on the penguins dataset.
# Call str() on `penguins`
str(penguins)
# Call str() on `penguins`
str(penguins)You might notice that some values are listed as NA. This means there are missing observations. To check for missing values, try using the embedded any(is.na()) function.
# Try using any(is.na()) on the `penguins` dataset.
any(is.na(penguins))
# Try using any(is.na()) on the `penguins` dataset.
any(is.na(penguins))Since this returns TRUE, we know there are missing values in the dataset. Let’s find out how many.
# Try using sum(is.na()) on the `penguins` dataset.
sum(is.na(penguins))
# Try using sum(is.na()) on the `penguins` dataset.
sum(is.na(penguins))There are missing observations. This can cause problems when generating numerical summaries or running statistical tests. At this stage, to handle this, we can remove the missing values. It’s always best practice to store your cleaned data in a new dataset and leave the original untouched. Note that you should always be careful when removing observations, as even partially filled observations can still provide valuable information.
clean_penguins <- na.omit(penguins)
clean_penguins <- na.omit(penguins)clean_penguins <- na.omit(penguins)We can check the structure again to make sure everything looks good:
str(clean_penguins)tibble [333 × 8] (S3: tbl_df/tbl/data.frame)
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm : num [1:333] 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
$ bill_depth_mm : num [1:333] 18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
$ flipper_length_mm: int [1:333] 181 186 195 193 190 181 195 182 191 198 ...
$ body_mass_g : int [1:333] 3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 1 2 1 2 2 ...
$ year : int [1:333] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
- attr(*, "na.action")= 'omit' Named int [1:11] 4 9 10 11 12 48 179 219 257 269 ...
..- attr(*, "names")= chr [1:11] "4" "9" "10" "11" ...
We can see that there are eight variables. Three of these are categorical: species, island, and sex. These are stored as factors, where the levels represent the number of distinct labels or groups within each variable. For example, sex has two levels: male and female.
You’ll notice that there are four numerical variables: bill_length_mm, bill_depth_mm, flipper_length_mm, and year. If your instinct is that year doesn’t quite fit as right as a numerical variable, you would be right! It is better to treat year as a categorical variable.
How do we manually convert year to a factor?
Now if we run class(), we can see that year is now correctly stored as a categorical variable with three levels.
class(clean_penguins$year)[1] "factor"
How can we check the unique groupings or labels within year?
unique(clean_penguins$year)
unique(clean_penguins$year)This shows that the data was collected over the years , , and (ascending order).
You can also see from the structure output str() that the penguins dataset is stored as a 333 x 8 tibble, which is a type of data frame in R. This means that there are 333 rows (observations) and 8 columns (variables).
Check the dimensions of the cleaned dataset to confirm this.
dim(clean_penguins)
dim(clean_penguins)If you wish to have a closer look at inspecting the penguins dataset, you can use the View() function to open it in a new tab on RStudio.
View(penguins)
View(clean_penguins)Tables
Now that we’ve inspected the dataset, we can start thinking about how to visually represent and understand the data.
As we have learnt
library(gt)
library(tidyverse)
library(kableExtra)naplan %>%
dplyr::filter(reading_time_home > 30 & naplan_reading_score > 800) %>%
gt() %>%
tab_header(title = "Naplan") %>%
fmt_number(decimals=2) %>%
cols_label(
student_id = md("***Student ID***"),
school_id = md("***School ID***"),
grade = md("***Grade***"),
reading_time_home = md("***Reading Time***"),
parent_education = md("***Parent Education***"),
school_type = md("***School Type***"),
gender = md("***Gender***"),
birth_months = md("***Birth Month***"),
n_siblings = md("***Number of Siblings***"),
ses_index = md("***SES Index***"),
naplan_reading_score = md("***Reading Score***")
)| Naplan | ||||||||||
Student ID |
School ID |
Grade |
Reading Time |
Parent Education |
School Type |
Gender |
Birth Month |
Number of Siblings |
SES Index |
Reading Score |
|---|---|---|---|---|---|---|---|---|---|---|
| 493.00 | 1.00 | Year 9 | 65.00 | Year 12 | Independent | Male | Aug | 0.00 | 1.05 | 900.00 |
| 584.00 | 1.00 | Year 9 | 40.00 | Certificate/Diploma | Independent | Female | Sep | 1.00 | 0.68 | 900.00 |
| 626.00 | 35.00 | Year 9 | 40.00 | Bachelor degree | Government | Female | Dec | 2.00 | 1.16 | 804.00 |
| 741.00 | 42.00 | Year 9 | 105.00 | Year 12 | Government | Male | Aug | 5.00 | −0.44 | 900.00 |
| 1,118.00 | 42.00 | Year 9 | 95.00 | Certificate/Diploma | Government | Male | Jul | 2.00 | −0.28 | 883.00 |
| 1,168.00 | 31.00 | Year 9 | 60.00 | Postgraduate | Catholic | Female | Aug | 2.00 | 2.19 | 871.00 |
| 1,174.00 | 35.00 | Year 9 | 65.00 | Year 12 | Government | Male | Aug | 2.00 | −0.33 | 900.00 |
| 1,447.00 | 23.00 | Year 9 | 75.00 | Year 12 | Independent | Male | Sep | 2.00 | 0.30 | 900.00 |
| 1,792.00 | 5.00 | Year 9 | 70.00 | Year 12 | Government | Male | Aug | 1.00 | 0.52 | 900.00 |
| 1,906.00 | 5.00 | Year 9 | 70.00 | Year 10 or below | Government | Male | Jan | 5.00 | −0.71 | 887.00 |
| 1,952.00 | 35.00 | Year 9 | 75.00 | Year 12 | Government | Male | Aug | 1.00 | −0.17 | 900.00 |
| 2,122.00 | 5.00 | Year 9 | 55.00 | Postgraduate | Government | Male | Jul | 3.00 | 1.87 | 900.00 |
| 2,359.00 | 35.00 | Year 9 | 60.00 | Bachelor degree | Government | Female | Dec | 5.00 | 1.33 | 900.00 |
| 2,425.00 | 47.00 | Year 9 | 50.00 | Postgraduate | Independent | Female | Aug | 2.00 | 2.02 | 866.00 |
| 2,691.00 | 55.00 | Year 9 | 80.00 | Year 12 | Government | Female | Dec | 2.00 | −0.18 | 837.00 |
| 2,712.00 | 31.00 | Year 9 | 60.00 | Year 12 | Catholic | Male | Dec | 2.00 | 0.81 | 809.00 |
| 2,756.00 | 53.00 | Year 9 | 55.00 | Year 10 or below | Government | Female | Mar | 1.00 | −0.40 | 840.00 |
naplan %>%
filter(gender == "Female", school_type == "Independent") %>%
arrange(desc(naplan_reading_score)) %>%
slice_head(n = 10) %>%
gt() %>%
tab_header(title = "Naplan") %>%
fmt_number(decimals = 2) %>%
cols_label(
student_id = md("***Student ID***"),
school_id = md("***School ID***"),
grade = md("***Grade***"),
reading_time_home = md("***Reading Time***"),
parent_education = md("***Parent Education***"),
school_type = md("***School Type***"),
gender = md("***Gender***"),
birth_months = md("***Birth Month***"),
n_siblings = md("***Number of Siblings***"),
ses_index = md("***SES Index***"),
naplan_reading_score = md("***Reading Score***")
) %>%
data_color(
columns = naplan_reading_score,
method = "numeric",
palette = "viridis",
domain = c(0, 1000),
reverse = TRUE
) %>%
tab_style(
style = list(
cell_fill(color = "gray95"),
cell_borders(sides = c("l", "r"), color = "gray50", weight = px(3))
),
locations = cells_body(columns = -naplan_reading_score)
) %>%
tab_style(
style = cell_fill(color = "gray98"),
locations = cells_title()
) %>%
tab_style(
style = list(cell_fill(color = "gray35"), cell_text(color = "white")),
locations = list(cells_footnotes(), cells_source_notes())
) %>%
tab_style(
style = cell_text(weight = "bold"),
locations = cells_title(groups = "title")
) %>%
tab_style(
style = cell_borders(
sides = c("t", "b"),
color = "darkgrey",
weight = px(3)
),
locations = list(cells_column_labels(), cells_stubhead())
) %>%
tab_footnote(
footnote = "SES Index represents ",
locations = cells_column_labels(columns = ses_index)
)| Naplan | ||||||||||
Student ID |
School ID |
Grade |
Reading Time |
Parent Education |
School Type |
Gender |
Birth Month |
Number of Siblings |
SES Index 1 |
Reading Score |
|---|---|---|---|---|---|---|---|---|---|---|
| 584.00 | 1.00 | Year 9 | 40.00 | Certificate/Diploma | Independent | Female | Sep | 1.00 | 0.68 | 900.00 |
| 2,425.00 | 47.00 | Year 9 | 50.00 | Postgraduate | Independent | Female | Aug | 2.00 | 2.02 | 866.00 |
| 1,007.00 | 1.00 | Year 9 | 25.00 | Bachelor degree | Independent | Female | Apr | 0.00 | 1.09 | 831.00 |
| 887.00 | 1.00 | Year 7 | 50.00 | Bachelor degree | Independent | Female | Jul | 1.00 | 0.85 | 800.00 |
| 2,166.00 | 2.00 | Year 7 | 60.00 | Bachelor degree | Independent | Female | May | 1.00 | 2.40 | 800.00 |
| 2,610.00 | 23.00 | Year 9 | 35.00 | Bachelor degree | Independent | Female | Oct | 1.00 | 1.44 | 796.00 |
| 305.00 | 1.00 | Year 9 | 35.00 | Certificate/Diploma | Independent | Female | Apr | 3.00 | 0.46 | 788.00 |
| 1,220.00 | 47.00 | Year 9 | 35.00 | Certificate/Diploma | Independent | Female | Jul | 5.00 | 0.76 | 782.00 |
| 133.00 | 47.00 | Year 9 | 55.00 | Year 10 or below | Independent | Female | Sep | 3.00 | 0.36 | 780.00 |
| 1,383.00 | 47.00 | Year 9 | 30.00 | Certificate/Diploma | Independent | Female | Sep | 2.00 | 1.53 | 774.00 |
| 1 SES Index represents | ||||||||||
naplan |>
dplyr::group_by(gender) |>
dplyr::summarise(
mean_score = mean(naplan_reading_score, na.rm = TRUE),
mean_ses = mean(ses_index, na.rm = TRUE),
.groups = "drop"
) |>
tidyr::pivot_wider(
names_from = gender,
values_from = c(mean_score, mean_ses)
) |>
gt() |>
fmt_number(columns = everything(), decimals = 2) |>
tab_spanner(
label = "Female",
columns = matches("_Female$")
) |>
tab_spanner(
label = "Male",
columns = matches("_Male$")
) |>
cols_label(
matches("mean_score") ~ "Reading Score<br>Average",
matches("mean_ses") ~ "SES Index<br>Average",
.fn=md
) |>
cols_width(everything() ~ px(120))| Female | Male | ||
|---|---|---|---|
Reading Score |
SES Index |
Reading Score |
SES Index |
| 563.00 | 0.50 | 554.21 | 0.55 |
library(gtsummary)naplan |>
tbl_summary(
by = gender,
include=c(grade, reading_time_home, naplan_reading_score),
statistic=list(all_continuous()~"{mean} ({sd})"),
digits=list(naplan_reading_score = c(0,1),
reading_time_home = c(0,1))
)Characteristic |
Female |
Male |
|---|---|---|
| grade | ||
| Year 3 | 350 (24%) | 385 (25%) |
| Year 5 | 378 (26%) | 393 (25%) |
| Year 7 | 356 (25%) | 389 (25%) |
| Year 9 | 367 (25%) | 382 (25%) |
| reading_time_home | 22 (22.9) | 21 (22.7) |
| naplan_reading_score | 563 (95.7) | 554 (92.0) |
| 1
n (%); Mean (SD) |
||
naplan |>
tbl_summary(
include=c(reading_time_home, naplan_reading_score),
type=all_continuous() ~ "continuous2",
statistic=list(all_continuous()~c("{median} ({p25}, {p75})", "{min}, {max}"))
)Characteristic |
N = 3,000 |
|---|---|
| reading_time_home | |
| Median (Q1, Q3) | 15 (5, 30) |
| Min, Max | 0, 120 |
| naplan_reading_score | |
| Median (Q1, Q3) | 562 (491, 619) |
| Min, Max | 177, 900 |
naplan %>%
filter(gender == "Female", school_type == "Independent") %>%
arrange(desc(naplan_reading_score)) %>%
slice_head(n = 10) %>%
select(
`Student ID` = student_id,
`School ID` = school_id,
Grade = grade,
`Reading Time` = reading_time_home,
`Parent Education` = parent_education,
`School Type` = school_type,
Gender = gender,
`Birth Month` = birth_months,
`Number of Siblings` = n_siblings,
`SES Index` = ses_index,
`Reading Score` = naplan_reading_score
) %>%
mutate(
`Reading Score` = cell_spec(
`Reading Score`,
background = spec_color(`Reading Score`, option = "viridis", end = 0.95),
color = "white"
)
) %>%
kable(
format = "html",
align = "c",
booktabs = TRUE,
escape = FALSE,
caption = "Naplan"
) %>%
kable_styling(full_width = FALSE) %>%
row_spec(0, bold = TRUE, color = "white", background = "gray35") %>%
column_spec(
1:10,
background = "gray95",
extra_css = "border-left: 3px solid gray50; border-right: 3px solid gray50;"
) %>% footnote(general = "SES Index represents ...", general_title = "")| Student ID | School ID | Grade | Reading Time | Parent Education | School Type | Gender | Birth Month | Number of Siblings | SES Index | Reading Score |
|---|---|---|---|---|---|---|---|---|---|---|
| 584 | 1 | Year 9 | 40 | Certificate/Diploma | Independent | Female | Sep | 1 | 0.68 | 900 |
| 2425 | 47 | Year 9 | 50 | Postgraduate | Independent | Female | Aug | 2 | 2.02 | 866 |
| 1007 | 1 | Year 9 | 25 | Bachelor degree | Independent | Female | Apr | 0 | 1.09 | 831 |
| 887 | 1 | Year 7 | 50 | Bachelor degree | Independent | Female | Jul | 1 | 0.85 | 800 |
| 2166 | 2 | Year 7 | 60 | Bachelor degree | Independent | Female | May | 1 | 2.40 | 800 |
| 2610 | 23 | Year 9 | 35 | Bachelor degree | Independent | Female | Oct | 1 | 1.44 | 796 |
| 305 | 1 | Year 9 | 35 | Certificate/Diploma | Independent | Female | Apr | 3 | 0.46 | 788 |
| 1220 | 47 | Year 9 | 35 | Certificate/Diploma | Independent | Female | Jul | 5 | 0.76 | 782 |
| 133 | 47 | Year 9 | 55 | Year 10 or below | Independent | Female | Sep | 3 | 0.36 | 780 |
| 1383 | 47 | Year 9 | 30 | Certificate/Diploma | Independent | Female | Sep | 2 | 1.53 | 774 |
| SES Index represents ... |